VQ-based written language identification

نویسندگان

  • Tuan Pham
  • Dat Tran
چکیده

Humans can recognize different types of written languages by their grammars and vocabularies. However, computers see everything as numbers. We present a computational algorithm for machine classification of written languages using the method of vector quantization. For a language document, each word is converted to a sequence of numbers and forms as a vector of numerical values according to its characters. This collection of vectors is then represented by a codebook that contains a number of template vectors for classification. The proposed method is more effective for machine learning than the n-gram based method, which has been widely used for written language identification. Experimental results of classifying a set of five closely roman-typed scripts show the promising application of the proposed method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Asthma in Iranian Schoolchildren: Comparison of ISAAC Video and Written Questionnaires

Background: The international study of asthma and allergies in childhood (ISAAC) is used to define the prevalence and severity of asthma in different regions. In this study we followed the performance of the ISAAC video and written questionnaires (VQ and WQ) to classify asthma in 13-14 yr-old schoolchildren. Methods: The present study was carried out on 3540 schoolchildren 13 to 14-yrs-old us...

متن کامل

Two-stage speaker identification system based on VQ and NBDGMM

In this paper, a new speaker identification system is presented. The system can be divided into two subsystems, one close-set speaker identification system and one speaker verification system. The VQ model is used in the close-set speaker identification system and a new method called NBDGMM (Normalization Based on Difference of GMM) is introduced. Experiments have been done to prove that this s...

متن کامل

Vq-based Bayesian Estimation for Blur Identification and Image Selection in Video Sequences

We address the problem of blur identification and image selection with statistical blur priors in the context of the vector quantization (VQ) based framework. Firstly, we assume some dominant blur priors for estimating point spread functions (PSFs) of blurred frames in Bayesian MAP estimation. The blurred frames with estimated PSFs can be stored in VQ-based multiple codebooks. These codebooks c...

متن کامل

Graph-Based N-gram Language Identification on Short Texts

Language identification (LI) is an important task in natural language processing. Several machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003